The Scoring Trick

AesCLIP (ACM Multimedia 2023) builds on OpenAI's CLIP, specifically ViT-B/16, with fine-tuned weights for aesthetic assessment. The mechanism is dead simple: give CLIP two text prompts, "good image" and "bad image," compute how similar the photo's embedding is to each, and the relative similarity becomes a 0-to-10 score. No classification head, no regression layer, just the geometry of CLIP's shared vision-language space.

I was skeptical that something this blunt could match human judgment. On my own archive of 80,000+ photos accumulated over fifteen years, the top-rated images are consistently the ones I'd have picked myself: the model has a slight bias toward vivid colors and strong composition, which happens to align with my taste.

80,000 Photos

A model that scores one photo is a toy. I needed something that incrementally processes an entire archive, remembers what it's already seen, and recovers from crashes.

Each input folder gets its own SQLite database storing file paths, scores, sizes, and modification times. Before processing a file, the system checks whether it's already been scored and whether its size or mtime have changed; unchanged files get skipped, so re-running on a folder of 10,000 already-scored photos finishes in seconds. SHA256 content hashes catch duplicates across folders (copies, backups, re-exports) and score them only once.

Images are resized to 224×224 and processed in batches of 64 on the GPU to keep CUDA utilization high. SQLite writes are batched in groups of 256 to prevent database corruption if the process dies mid-run. A heap-based running median tracks the live score distribution during processing, useful for calibrating expectations when you're staring at a progress bar for an hour.

The CLI

rate <folder>       # Score all photos in a folder
top <folder> -n 50  # Show the 50 highest-rated
slideshow <folder>  # Full-screen slideshow, sorted by score

Built with Typer and Rich for progress bars and formatted tables. The slideshow mode is the one I actually use: point it at a folder and it cycles through the best photos. Fifteen years of photography, distilled.

Error Handling

Real photo archives are messy: corrupt JPEGs, truncated files, RAW formats the decoder doesn't understand, HEIC files from iPhones. The system distinguishes three error categories: decode errors get logged and the file marked as unscoreable (processing continues), database errors are fatal (data integrity is non-negotiable), and CUDA errors are fatal (GPU state may be corrupted). A corrupt JPEG shouldn't halt a 10-hour run. A corrupted database should.

On Zero-Shot Scoring

AesCLIP doesn't need examples of "good" or "bad" from my specific collection: the "good image" / "bad image" prompt pair works because CLIP's training data already encodes a broad consensus about visual aesthetics. I didn't expect this to generalize as well as it does, and I haven't felt the need for personalized fine-tuning.

The incremental processing layer, the SQLite persistence, the deduplication, and the crash recovery took maybe 20% of development time but account for most of the practical value. Without them this would be a one-off script I ran once and forgot about.

Of 80,000 photos, roughly 500 scored above 7.5. That's 0.6%. The vast majority of casual photographs are, aesthetically speaking, mediocre. The good ones really are special, and now I know which ones they are.